imagenet 128 128
Improved Transformer for High-Resolution GANs: Supplementary Material Long Zhao
We provide more architecture and training details of the proposed HiT as well as additional experimental results to help better understand our paper. MQA is identical except that the different heads share a single set of keys and values. We report detailed results in Table 1 on ImageNet 128 128 . "pixel shuffle" indicates the pixel shuffle operation [ " indicates the blocking operation producing non-overlapping feature blocks, each of which has We use Tensorflow for implementation. We provide the detailed description about the generative process of the proposed HiT in Algorithm 1. See Algorithm 3 for more details about blocking and unblocking. X and Y are blocked feature maps where m is # of patches and n is patch sequence length. Args: X: a tensor used as query with shape [b, m, n, d] Y: a tensor used as key and value with shape [b, m, n, d] W_q: a tensor projecting query with shape [h, d, k] W_k: a tensor projecting key with shape [d, k] W_v: a tensor projecting value with shape [d, v] W_o: a tensor projecting output with shape [h, d, v] Returns: Z: a tensor with shape [b, m, n, d] """ Q = tf.einsum("bmnd,hdk->bhmnk",
A Additional Results
FID evaluated over 10k samples instead of 50k for efficiency. It is thus important to compare our method's compute requirements to competing methods. BigGAN-deep with the same or lower compute budget. We include communication time across two machines whenever our training batch size doesn't We find that a naive implementation of our models in PyTorch 1.7 is very inefficient, utilizing only Table 7: Throughput of our ImageNet models, measured in Images per V100-sec. In addition, we can train for many fewer iterations while maintaining sample quality superior to BigGAN-deep.
Improved Transformer for High-Resolution GANs: Supplementary Material Long Zhao
We provide more architecture and training details of the proposed HiT as well as additional experimental results to help better understand our paper. MQA is identical except that the different heads share a single set of keys and values. We report detailed results in Table 1 on ImageNet 128 128 . "pixel shuffle" indicates the pixel shuffle operation [ " indicates the blocking operation producing non-overlapping feature blocks, each of which has We use Tensorflow for implementation. We provide the detailed description about the generative process of the proposed HiT in Algorithm 1. See Algorithm 3 for more details about blocking and unblocking. X and Y are blocked feature maps where m is # of patches and n is patch sequence length. Args: X: a tensor used as query with shape [b, m, n, d] Y: a tensor used as key and value with shape [b, m, n, d] W_q: a tensor projecting query with shape [h, d, k] W_k: a tensor projecting key with shape [d, k] W_v: a tensor projecting value with shape [d, v] W_o: a tensor projecting output with shape [h, d, v] Returns: Z: a tensor with shape [b, m, n, d] """ Q = tf.einsum("bmnd,hdk->bhmnk",
A Additional Results
FID evaluated over 10k samples instead of 50k for efficiency. It is thus important to compare our method's compute requirements to competing methods. BigGAN-deep with the same or lower compute budget. We include communication time across two machines whenever our training batch size doesn't We find that a naive implementation of our models in PyTorch 1.7 is very inefficient, utilizing only Table 7: Throughput of our ImageNet models, measured in Images per V100-sec. In addition, we can train for many fewer iterations while maintaining sample quality superior to BigGAN-deep.
EM Distillation for One-step Diffusion Models
Xie, Sirui, Xiao, Zhisheng, Kingma, Diederik P, Hou, Tingbo, Wu, Ying Nian, Murphy, Kevin Patrick, Salimans, Tim, Poole, Ben, Gao, Ruiqi
While diffusion models can learn complex distributions, sampling requires a computationally expensive iterative process. Existing distillation methods enable efficient sampling, but have notable limitations, such as performance degradation with very few sampling steps, reliance on training data access, or mode-seeking optimization that may fail to capture the full distribution. We propose EM Distillation (EMD), a maximum likelihood-based approach that distills a diffusion model to a one-step generator model with minimal loss of perceptual quality. Our approach is derived through the lens of Expectation-Maximization (EM), where the generator parameters are updated using samples from the joint distribution of the diffusion teacher prior and inferred generator latents. We develop a reparametrized sampling scheme and a noise cancellation technique that together stabilizes the distillation process. We further reveal an interesting connection of our method with existing methods that minimize mode-seeking KL. EMD outperforms existing one-step generative methods in terms of FID scores on ImageNet-64 and ImageNet-128, and compares favorably with prior work on distilling text-to-image diffusion models.
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.96)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty (0.67)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.34)
Diffusion Models Beat GANs on Image Synthesis
Dhariwal, Prafulla, Nichol, Alex
We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models. We achieve this on unconditional image synthesis by finding a better architecture through a series of ablations. For conditional image synthesis, we further improve sample quality with classifier guidance: a simple, compute-efficient method for trading off diversity for sample quality using gradients from a classifier. We achieve an FID of 2.97 on ImageNet 128$\times$128, 4.59 on ImageNet 256$\times$256, and 7.72 on ImageNet 512$\times$512, and we match BigGAN-deep even with as few as 25 forward passes per sample, all while maintaining better coverage of the distribution. Finally, we find that classifier guidance combines well with upsampling diffusion models, further improving FID to 3.85 on ImageNet 512$\times$512. We release our code at https://github.com/openai/guided-diffusion
- Research Report (0.64)
- Overview (0.46)